Jet: An Embedded DSL for High Performance Big Data Processing
نویسندگان
چکیده
Cluster computing systems today impose a trade-off between generality, performance and productivity. Hadoop and Dryad force programmers to write low level programs that are tedious to compose but easy to optimize. Systems like Dryad/LINQ and Spark allow concise modeling of user programs but do not apply relational optimizations. Pig and Hive restrict the language to achieve relational optimizations, making complex programs hard to express without user extensions. However, these extensions are cumbersome to write and disallow program optimizations. We present a distributed batch data processing framework called Jet. Jet uses deep language embedding in Scala, multi-stage programming and explicit side effect tracking to analyze the structure of user programs. The analysis is used to apply projection insertion, which eliminates unused data, as well as code motion and operation fusion to highly optimize the performance critical path of the program. The language embedding and a high-level interface allow Jet programs to be both expressive, resembling regular Scala code, and optimized. Its modular design allows users to extend Jet with modules that produce good performing code. Through a modular code generation scheme, Jet can generate programs for both Spark and Hadoop. Compared with näıve implementations we achieve 143% speedups on Spark and 126% on Hadoop.
منابع مشابه
Design and Test of the Real-time Text mining dashboard for Twitter
One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...
متن کاملDeclarative Processing of Semistructured Web Data
In order to give application programs access to data stored in the web in semistructured formats, in particular, in XML format, we propose a domain-specific language (DSL) for declarative processing such data. Our language is embedded in the functional logic programming language Curry and offers powerful matching constructs that enable a declarative description of accessing and transforming XML...
متن کاملDomain-specific optimization and inference
When writing computer software one is often forced to balance the need for high runtime performance with high programmer productivity. By using a high-level language it is often possible to cut development times, but this typically comes at the cost of reduced run-time performance. Using a lower-level language, programs can be made very efficient but at the cost of increased development time. R...
متن کاملHigh-Level GPU Programming: Domain-Specific Optimization and Inference
When writing computer software one is often forced to balance the need for high runtime performance with high programmer productivity. By using a high-level language it is often possible to cut development times, but this typically comes at the cost of reduced run-time performance. Using a lower-level language, programs can be made very efficient but at the cost of increased development time. R...
متن کاملUniting Language Embeddings for Fast and Friendly DSLs
The holy grail for a domain-specific language (DSL) is to be friendly and fast. A DSL should be friendly in the sense that it is easy to use by DSL end-users, and easy to develop by DSL authors. DSLs can be developed as entirely new compilers and ecosystems, which requires tremendous effort and often requires DSL authors to reinvent the wheel. Or, DSLs can be developed as libraries embedded in ...
متن کامل